MRC Clinical Sciences Centre (http://mrccsc.github.io/)
14/Nov/2016
correlation
linear regression
A common task in statistical analysis is to investigate the linear relationship between pairs of numeric vectors.
This can be done by identifying the correlation between numeric vectors using the cor() function in R.
In this example we use cor() to identify the Pearson correlation between two variables. The method argument may be set to make use of different correlation methods.
> x <- rnorm(100,10,2)
> z <- rnorm(100,10,2)
> y <- x
> cor(x,y) #
[1] 1
> cor(x,-y)
[1] -1
> cor(x,z)
[1] -0.00622123
Example of blood pressure of 15 males taken by machine and expert. We would like to see whether there is a relationship between machine and expert measured blood pressure.
> #install.packages("UsingR")
> library("UsingR")
> data(blood)
> head(blood)
Machine Expert
1 68 72
2 82 84
3 94 89
4 106 100
5 92 97
6 80 88
> cor(blood$Machine,blood$Expert)
[1] 0.9068599
Often we wish to apply correlation analysis to all columns or rows in a matrix in a pair-wise manner. To do this in R, we can simply pass the cor() function a single argument of the numeric matrix of interest. The cor() function will then perform all pair-wise correlations between columns.
> data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> iris4cor<-iris[,c(1:4)];
> # or
> # iris4cor<-iris[,-5]
> cor(iris4cor)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
> pairs(iris4cor)
We have seen how we can find the linear correlation between two sets of variables using cor() function.
R also provides a comprehensive set of tools for regression analysis including the well used linear modeling function lm()
minimize the vertical distance between the fitted line and data points
We use kid.weights dataset as example and see whether we can use kids height to predict kids weight
> #install.packages("UsingR")
> #library("UsingR")
> data(kid.weights)
> str(kid.weights)
'data.frame': 250 obs. of 4 variables:
$ age : num 58 103 87 138 82 52 28 79 107 45 ...
$ weight: num 38 87 50 98 47 30 24 45 144 24 ...
$ height: num 38 43 48 61 47 24 29 48 59 24 ...
$ gender: Factor w/ 2 levels "F","M": 2 2 2 2 1 1 2 1 2 2 ...
> # or using head() function
> head(kid.weights)
age weight height gender
1 58 38 38 M
2 103 87 43 M
3 87 50 48 M
4 138 98 61 M
5 82 47 47 F
6 52 30 24 F
The lm() function fits a linear regression to your data and provides useful information on the generated fit.
In the example below we fit a linear model using lm() on the kid.weights dataset with weight (Y) as the dependent variable and height (X) as the explanatory variable.
> lmResult<-lm(formula = weight ~ height, data = kid.weights)
> lmResult
Call:
lm(formula = weight ~ height, data = kid.weights)
Coefficients:
(Intercept) height
-31.342 1.909
As we have seen, printing the model result provides the intercept and slope of line. To get some more information on the model we can use the summary() function
> summary(lmResult)
Call:
lm(formula = weight ~ height, data = kid.weights)
Residuals:
Min 1Q Median 3Q Max
-24.837 -7.361 -2.384 3.616 68.708
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31.34191 3.17493 -9.872 <2e-16 ***
height 1.90904 0.08343 22.882 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.09 on 248 degrees of freedom
Multiple R-squared: 0.6786, Adjusted R-squared: 0.6773
F-statistic: 523.6 on 1 and 248 DF, p-value: < 2.2e-16
> lmResult$coefficients
(Intercept) height
-31.341912 1.909044
From the $coefficients of object lmResult, we know the equation for the best fit is
\[ Y = -31.342 + 1.909044 *X \]
\[ f(x) = b_0 + b_1x \]
\[ b_0\text{: the value of f(x) when x =0} \]
# the Intercept -31.341912 is the expected weight of a 0 inch height
# not interesting to any biological questions
\[ b_1\text{: the amount of f(x) will change when x changes 1 unit} \]
# For every inch increased in the kids height, we expect 1.9 pounds increased in the kids weight
Predict the kid weight with the height information.
If we have 3 kids with height = 23, 40 and 66 inches, how do we predict their weights?
Use the information from the $coefficients
> new_kid_height<-c(23,40,66)
> beta0<-lmResult$coefficients[1]
> beta1<-lmResult$coefficients[2]
>
> predicted_new_kid_weight<-beta0+beta1*new_kid_height
> predicted_new_kid_weight
[1] 12.56609 45.01984 94.65497
Or use the predict()
new_kid_height_df <- data.frame(height=c(23,40,66))
cleaver_predicted_kid_weight<-predict(lmResult,new_kid_height_df)
cleaver_predicted_kid_weight
1 2 3
12.56609 45.01984 94.65497
The residuals are the difference between the predicted and actual values. To retrieve the residuals we can access the slot or use the resid() function.
> summary(resid(lmResult))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-24.840 -7.361 -2.384 0.000 3.616 68.710
> summary(lmResult$residual)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-24.840 -7.361 -2.384 0.000 3.616 68.710
Ideally you would want your residuals to be normally distributed around 0.
\[ E[e_{i}]=0 \]
Plot the residuals
plot(kid.weights$height,kid.weights$weight,ylim=c(-30,150),
ylab="weigth (pounds)",xlab="height (inches)")
abline(lmResult,col="blueviolet",lwd=3, lty=1)
Residual is the vertical distance between the observed data and the regression line. It has the same unit as the dependent variable.
SSE shows the residual variability
It shows the variability that cannot be explained by the regression model
\[ Error_i = y_i - \hat{y_i} \\ y_i\text{: the observed weight of ith kid} \\ \hat{y_i}\text{: the predicted weight of ith kid} \\ Error_i^2 = (y_i - \hat{y_i})^2 \\ \text{- sum of the square of the residuals (SSE)} \\ SSE = \sum_{i=1}^{n}(y_i-\hat{y_i})^2 \]
Plot the residuals against the independent variable (X), i.e. the height. It makes the residual accessment easiler by eyes.
plot(kid.weights$height,lmResult$residual,ylim=c(-30,150),
ylab="residuals (weight pounds)",xlab="height (inches)")
abline(h=0,col="blueviolet",lwd=3, lty=1)
Plot the residuals against the independent variable (X), i.e. the height.
Plot the residuals against the independent variable (X)
\[ Error_i = y_i - \hat{y_i} \\ Error_i^2 = (y_i - \hat{y_i})^2 \\ \text{- sum of the square of the residuals (SSE)} \\ SSE = \sum_{i=1}^{n}(y_i-\hat{y_i})^2 \]
The R-squared value represents the proportion of variability in the response variable that is explained by the explanatory variable.
A high R-squared here indicates that the line fits closely to the data.
> summary(lmResult)$r.squared
[1] 0.6785746
> kid.weights$weight
[1] 38 87 50 98 47 30 24 45 144 24 40 15 20 10 19 29 80
[18] 20 28 24 35 30 52 72 28 14 18 11 31 28 31 14 38 15
[35] 27 60 40 40 21 47 46 30 69 76 16 11 38 25 29 21 36
[52] 28 41 13 21 33 40 32 27 20 30 45 34 19 41 40 19 14
[69] 43 90 45 29 105 94 22 69 17 27 12 40 25 13 20 16 45
[86] 27 20 54 60 30 18 55 14 34 45 48 30 25 38 13 22 19
[103] 74 30 89 35 33 78 43 40 32 31 15 19 55 40 37 34 45
[120] 44 50 50 28 32 22 52 40 42 22 23 16 17 30 40 32 26
[137] 30 61 65 52 23 29 18 32 50 13 36 47 80 23 64 45 25
[154] 65 26 38 20 150 42 53 35 16 150 25 30 52 27 27 22 30
[171] 35 42 12 47 14 32 25 93 30 59 100 34 52 43 32 13 18
[188] 38 17 46 29 50 14 13 26 11 32 70 16 50 26 23 37 60
[205] 45 26 131 22 12 70 86 35 14 85 58 28 17 34 19 13 18
[222] 23 17 30 41 30 13 125 32 18 35 32 108 40 42 28 26 26
[239] 23 42 28 55 65 80 49 18 14 26 85 14
Question: How would you describe (or summarize) kid's weight when the height information is absence?
mean might be a good choice
> mean(kid.weights$weight)
[1] 38.384
Question: How would you describe (or summarize) kid's weight when the height information is absence?
mean might be a good choice
The fraction of variability in the independent variable (Y; or the weight in this example) that can be explained by the explanatory variable (X; or the height in this example).
\[ TSS=\text{Total Sum of Squares}=\sum_{i=1}^n(y_i-\overline y)^2 \\ SSE=\text{Sum of the Square of the residuals}=\sum_{i=1}^n(y_i-\hat{y})^2 \]
> SSE<-sum(lmResult$residuals^2)
> TSS<-sum((kid.weights$weight - mean(kid.weights$weight))^2)
> R_square<-1-(SSE/TSS)
> R_square
[1] 0.6785746
> summary(lmResult)$r.squared
[1] 0.6785746
Residuals from the model
Residuals from the mean: assuming the independent variable (X), i.e. height in our case, does not exist
\[ \begin{aligned} TSS = \sum_{i=1}^{n}(y_i-\overline y)^2 \end{aligned} \]
The results from linear models also provides a measure of significance for a variable not being relevant
> summary(lmResult)$fstatistic
value numdf dendf
523.5631 1.0000 248.0000
\[ F=\frac{MSM}{MSE}=\frac{\text{mean of the explained variance}}{\text{mean of the unexplained variance}}=\frac{({\displaystyle\frac{SSM}1})}{({\displaystyle\frac{SSE}{n-2}})} \]
> n=nrow(kid.weights)
> SSM <- sum((predict(lmResult) - mean(kid.weights$weight))^2)
> MSE <-sum(lmResult$residuals^2)/(n-2)
>
> MSM <-SSM/1
>
> MSM/MSE
[1] 523.5631
> summary(lmResult)$fstatistic
value numdf dendf
523.5631 1.0000 248.0000
Use the Petal.Width to predict the Petal.Length from the iris data as example
We will focus on Iris.versicolor as example
> iris_versi<-iris[iris$Species=="versicolor", c("Petal.Length","Petal.Width")]
> dim(iris_versi)
[1] 50 2
> head(iris_versi)
Petal.Length Petal.Width
51 4.7 1.4
52 4.5 1.5
53 4.9 1.5
54 4.0 1.3
55 4.6 1.5
56 4.5 1.3
If we only know the Petal.Length, and would like to use this information to predict the Petal.Length
> head(iris_versi[,"Petal.Length"])
[1] 4.7 4.5 4.9 4.0 4.6 4.5
> PetalLen.mean<-mean(iris_versi$Petal.Length)
> PetalLen.mean
[1] 4.26
> plot(iris_versi$Petal.Length,ylab="Petal Length of Iris.versicolor")
> abline(h=PetalLen.mean, col="forestgreen",lwd=3)
If we only know the Petal.Length, and would like to use this information to predict the Petal.Length
\[ \text{In this case, the expected value is mean } = \overline y \]
\[ \begin{aligned} Error_i & = y_i - \overline y \\ \\ \end{aligned} \]
Zoom in [just see first 4 data points]
\[ \text{In this case, the expected values is mean } = \overline y \]
\[ \begin{aligned} \\ Error_i & = y_i - \overline y \end{aligned} \]
\[ \begin{aligned} Error_i^2 = (y_i - \overline y)^2 \end{aligned} \]
\[ \begin{aligned} SSE = \sum_{i=1}^{n}(y_i-\overline y)^2 \end{aligned} \]
Exercise on this session can be found here
Answers can be found here
R code for solutions can be found here [here]